Fix process and file descriptor leaks in Salt Master by dwoz · Pull Request #69227 · saltstack/salt

dwoz · 2026-05-26T05:49:54Z

Ensure proper resource lifecycle management and process reaping to resolve leaks introduced between 3006.20 and 3006.25.

Call wait() after kill() in TimedProc to prevent zombie processes.
Implement context manager protocol and destroy() in SaltEvent, RunnerClient, WheelClient, and MasterMinion.
Update masterapi.py to ensure RunnerClient is used within a with statement.
Explicitly destroy persistent objects in RemoteFuncs and LocalFuncs during teardown.
Initialize internal attributes to None and fix variable scope issues to achieve 10/10 pylint rating.

dwoz · 2026-06-08T06:22:32Z

Heads up: 30 integration shard 4-6 failures across distros (zeromq/tcp) at 29ea4d6 — same pattern as the earlier 57, same root cause: MWorker._handle_payload destroys aes_funcs/clear_funcs mid-flight on the 100th request while the tornado coroutine still holds futures bound to those handlers. Sampled tests/pytests/integration/proxy/test_deltaproxy.py etc; FactoryTimeouts + stuck "state.highstate is running as PID..." on every distro. Recycling block is correct in intent but needs coroutine drain + atomic swap before it's safe. Branch is rebased onto origin/3006.x; let me know if you want it removed again or left for your follow-up.

Ensure proper resource lifecycle management and process reaping to resolve leaks introduced between 3006.20 and 3006.25. - Call wait() after kill() in TimedProc to prevent zombie processes. - Implement context manager protocol and destroy() in SaltEvent, RunnerClient, WheelClient, and MasterMinion. - Update masterapi.py to ensure RunnerClient is used within a with statement. - Explicitly destroy persistent objects in RemoteFuncs and LocalFuncs during teardown. - Initialize internal attributes to None and fix variable scope issues to achieve 10/10 pylint rating.

Resolve process, file descriptor, and memory leaks by adding explicit teardown and context managers for clients, periodic worker cache resets, and optimizing token listing with os.scandir.

Include a GitHub Actions workflow that performs aggressive stress testing on Master, Minion, and API components, with automated Prometheus-based regression analysis to detect resource leaks.

Address pylint warnings in analyze_stats.py and fd_exporter.py related to unused imports, broad exceptions, and resource management.

Enable post-build introspection by snapshotting the Prometheus data directory and uploading it as a GitHub Action artifact.

Explicit resource cleanup is preferred; __del__ methods introduced during memory and file handle leak fixes have been removed from: - MasterMinion - RunnerClient - SaltEvent - WheelClient

Close the underlying PyZMQ monitor socket explicitly in `ZeroMQSocketMonitor.stop()` to prevent eventfd accumulation during repeated Minion connection failures. Fixes test_fd_leak.py

…ripts - rest_cherrypy Login.POST referenced self.api which was removed from LowDataAdapter; wrap _is_master_running in a NetapiClient with-block. This fixes HTTP 500 on /login in functional/integration tests. - store_job now uses 'with MasterMinion(...)'; MockMasterMinion and MockNetapiClient need __enter__/__exit__. - Exclude tests/monitoring scripts from test_module_names check.

The per-100-request recycling destroyed aes_funcs/clear_funcs (channels, event, ckminions, loadauth, masterapi) on a live worker while requests were still in flight, silently breaking publish/return on heavier integration shards (4-6) across every distro. Removed the block; the existing destroy() paths on shutdown handle cleanup correctly.

- Restored worker_resource_backcount logic in MWorker to prevent memory accumulation in long-running clear/aes functions. - Increased flat memory buffer in test_publisher_mem to prevent false positive failures from PyZMQ/Python memory fragmentation.

NetapiClient.runner now uses RunnerClient as a context manager (salt/netapi/__init__.py). The four regression tests in test_netapi_client_runner.py patched salt.runner.RunnerClient with a FakeRunner that lacked __enter__/__exit__, so every distro on unit zeromq shard 3 failed with AttributeError: __enter__. Add no-op __enter__ returning self and __exit__ returning False to each FakeRunner so the with-block in NetapiClient.runner works against the test double.

Address several long-standing memory leaks discovered during stress testing of the salt master and minions. salt/daemons/masterapi.py + salt/master.py Maintenance loader caching. clean_expired_tokens() and clean_old_jobs() now accept optional loadauth/mminion arguments so callers can reuse a long-lived instance. Maintenance.__init__ caches one LoadAuth and one MasterMinion in _post_fork_init and reuses them every iteration, destroying both in destroy(). Without this each Maintenance loop iteration constructed fresh LoadAuth + MasterMinion instances, triggering a fresh LazyLoader + __virtual__ cascade + module-load chain whose bytecode/dict/string allocations were retained in sys.modules. This was the dominant driver of the Maintenance-process slow drift (~2.4 MB/min) — now flat. salt/transport/frame.py + salt/transport/ipc.py 4-byte big-endian length prefix on frame_msg_ipc, and matching exact-length readers in IPCServer.handle_stream and IPCMessageSubscriber._read (drops the streaming msgpack Unpacker). The Unix-domain-socket atomic-write boundary (~65 536 bytes) was causing concurrent large writes (e.g. beacon status frames + flood events) to interleave, leaving the streaming Unpacker desynchronised and producing UnicodeDecodeError / ExtraData crashes in EventReturn and any other long-running subscribers. With explicit framing the receiver always knows where one message ends and the next begins. salt/transport/ipc.py IPCMessagePublisher._write converted from a @gen.coroutine to a regular function with a done-callback. Each published message was spawning a long-lived gen.Runner per subscriber stream that waited inside the stream.write yield until the OS drained the bytes. Under high event rates the Runner / generator / frame / Future quadruple was the dominant residual minion leak (905+ Runners observed). Now the callback fires asynchronously without spawning a coroutine. salt/transport/zeromq.py Three ZMQ callback-registration sites (RequestServer, PublishServer pull_sock, and PublishClient.on_recv) now wrap the Tornado @gen.coroutine handler in a _dispatch shim that routes through io_loop.spawn_callback() and returns None to PyZMQ. Previously PyZMQ's _run_callback wrapped any Awaitable return value with asyncio.ensure_future(), creating Tasks on the asyncio event loop that was never driven by the Tornado IOLoop — Tasks (plus their gen.Runner / Future / WeakRef tracking sets) accumulated indefinitely. The minion-side fix (PublishClient) alone removed ~18,800 Tornado Runners observed under stress. salt/utils/event.py Three independent hardening changes: - SaltEvent._get_event now catches SaltDeserializationError so a single malformed/corrupted IPC frame can no longer kill the entire subscriber loop. - EventPublisher.run installs a 5-minute PeriodicCallback that calls libc.malloc_trim(0) to release glibc arena pages. High- throughput event publishing fragments the allocator heavily; the EventPublisher's RSS routinely sat at >1 GB of free-but-unreturned pages without this. - EventReturn now validates configured event_return returners at startup (emitting a clear one-shot error if anything is missing) and rate-limits the per-event "returner not found" message to once per 60 s per returner. With event_return_queue=0 the previous code emitted that error for every single event, which could fill log volumes in minutes under stress. salt/minion.py Periodic gc.collect() PeriodicCallback in Minion.tune_in. Tornado coroutine timeouts (FutureWithTimeout, Runner.handle_yield closures, traceback objects, etc.) create reference cycles that Python's default GC thresholds (700, 10, 10) collect too rarely for the rate at which they accumulate in a busy minion. Running a full collection every 60 s keeps the working set steady.

…n, dashboard Improvements to the tests/monitoring stack so the salt master/minion leak hunt is reproducible and observable. tests/monitoring/Dockerfile.salt Rebuild Python 3.10.20 from source with CFLAGS="-g -O2 -fno-omit-frame-pointer" so memray's --native frame unwinder and gdb can resolve CPython symbols (the stock python:3.10 slim image strips them). Adds gdb and memray at image-build time so attaching a profiler to a long-running process no longer requires apt-get + pip inside the container. Also fixes the stale requirements/base.in -> .txt path. tests/monitoring/master.conf + tests/monitoring/minion.conf Set ipc_write_buffer: 104857600 (100 MB) on both master and minion to cap Tornado IOStream._write_buffer growth on the local event-bus IPC publisher. Without this, one slow subscriber on either side caused a single bytearray to grow unbounded under stress (>1 GB observed on master EventPublisher, ~80+ MB/process and climbing on minions). Switch minion log_level from debug -> warning; debug logging in long-running container stress runs filled tens of GB of Docker JSON logs per minion. tests/monitoring/prometheus.yml Move the salt-fds target to port 8002 to match fd_exporter.py's listen port. tests/monitoring/stress_api.sh Drop the per-iteration "frequent logins" call. Hammering /login 10x/sec generated a CherryPy session per request, which inflated the salt-api Netapi process to >1 GB of session state — not a salt bug, just an unrealistic stress pattern. Real clients reuse one token. tests/monitoring/grafana/.../salt_monitoring.json Add a Current Time stat panel (uses Prometheus time() so the dashboard prominently shows when "now" is during long captures), default to a 30-minute window with 10-second auto-refresh, and honour the browser timezone. Reshapes the row heights to make room.

# Conflicts: # salt/utils/event.py

The maintenance loader caching work (commit d4e2e07) moved cached LoadAuth + MasterMinion instance construction into Maintenance._post_fork_init and made the main loop reference self._cached_mminion / self._cached_loadauth. test_run_func mocks _post_fork_init wholesale, so those attributes never get set, and Maintenance.run() now raises AttributeError on the first iteration. Have the mocked _post_fork_init seed both attributes with MagicMock so the loop body still has something to call.

The earlier IPC framing rewrite collapsed _read to a single read. read_async, however, calls _read(None, callback) once and expects the coroutine to loop forever invoking the callback on every incoming message, the same shape the streaming-Unpacker version had via its `while True:` outer loop. With the loop gone, every subscriber registered via SaltEvent.set_event_handler delivered exactly one event and then went deaf. On the minion that breaks the `__master_req_channel_payload/<id>/<master>` handler, so command returns never reach the master and `salt '*' ...` reports "Minion did not return [No response]". Restore the `while True:` loop, breaking out only when no callback was supplied (one-shot read) or the stream closes / times out. Drop the `timeout` after the first length prefix arrives so the payload read is not artificially constrained.

CI run 27918914154 surfaced two regressions introduced by d4e2e07 ("Fix multiple memory leaks ..."). 1. EventPublisher hardcoded ctypes.CDLL("libc.so.6") The malloc_trim PeriodicCallback was glibc-only and raised OSError on macOS and Windows where libc.so.6 does not exist. The EventPublisher process crashed at startup and was restarted in a tight loop by the SignalHandlingProcess parent, so the master fixture never became fully usable and every test in every chunk that depends on a real master failed with FactoryNotStarted. malloc_trim was never a real leak fix to begin with -- it only released free()'d glibc arena pages back to the OS to make RSS look smaller on graphs; glibc would have re-used the same pages on the next allocation cycle. Drop the malloc_trim call entirely (and the now-unused `import ctypes`). 2. IPCMessagePublisher.publish iterated a live set while _write could discard from it When _write was converted from a coroutine to a regular function it began calling self.streams.discard(stream) synchronously on StreamClosedError. publish() was iterating self.streams directly, so a stream that was closed at write time raised RuntimeError: Set changed size during iteration. The exception killed EventPublisher's handle_publish loop, so beacon events (and many other minion-local fire_event payloads) never reached the local subscribers, and salt-call commands like beacons.reset hung until the pytest-shellutils factory timed out. Iterate tuple(self.streams) so _write's discards do not mutate the iteration target.

The MWorkerQueue process RSS climbed unbounded under sustained salt CLI traffic. Three independent libzmq behaviours stacked on top of each other. 1. ``zmq.Context(self.opts["worker_threads"])`` The first argument to ``zmq.Context`` is ``io_threads`` -- the number of background I/O threads libzmq spawns -- not the number of MWorker processes. Each libzmq I/O thread keeps its own message-buffer pool that grows under traffic and is never released. With ``worker_threads: 10`` the proxy process was bleeding ~7-8 MB/min of arena pages purely from that. Drop it to ``zmq.Context(1)``: the QUEUE device proxies two sockets and one I/O thread is plenty. Before/after under heavy stress: ``10 ZMQbg/IO/* threads, ~360 anon mmap regions, 10.5 GB in 3 h`` -> ``1 ZMQbg/IO/0 thread, ~4 regions, ~200 MB after 90 min``. 2. ``LINGER=-1`` on the ROUTER + DEALER ``LINGER=-1`` ("never discard") combined with the salt CLI's one-shot connection pattern (connect, send, recv, disconnect) caused libzmq to retain undelivered queue slots for every disconnected peer forever. Drop to ``LINGER=1000`` so libzmq reaps a peer's queue after 1 s; also enable ``ROUTER_HANDOVER=1`` (replace stale identity entries on reconnect rather than blocking) and explicit ``TCP_KEEPALIVE`` (60 s idle / 15 s interval / 3 probes) so peers that disappear without sending FIN get reaped without waiting on the OS default 2 h timer. 3. ``AsyncReqMessageClient`` opened every REQ socket with no ``ZMQ_IDENTITY`` set libzmq generates a fresh random 4-byte routing-id for each socket, so every salt CLI invocation appeared to the master as a brand-new peer and added one entry to the ROUTER's per-peer routing-id hashtable. Under stress this leaked ~6.4 MB/min linearly even after the changes above. Set a stable identity scoped by ``(role, hostname, uid, pid mod 256)`` so the table is bounded by user/host/concurrency rather than unbounded by total CLI invocations; combined with ``ROUTER_HANDOVER=1`` collisions just trigger handover. After all three the MWorkerQueue RSS is flat at ~56 MB under the same stress workload that previously drove it past 10 GB.

The minion daemon opens multiple AsyncReqMessageClient instances concurrently during startup (auth refresh, pillar fetch, file requests, etc.). With the identity formula scoped by ``(role, host, uid, pid mod 256)`` introduced in 569db36, all of those concurrent REQs share the same identity tuple. Combined with the master ROUTER's ``ROUTER_HANDOVER=1``, every new REQ silently replaced the prior one's connection -- including any reply still in flight to the prior REQ. Minions then failed to converge during startup within the 60s factory timeout in the package-install CI matrix. The CLI-churn case the identity scoping was meant to bound only applies to the salt CLI (``role == "clir"``) where each invocation is its own short-lived process; long-lived daemons that re-use their pid for many parallel REQs need libzmq's default per-connection random routing-ids so concurrent sockets do not collide on identity.

87b7a5b excluded the minion daemon from the stable identity formula but left the syndic and master daemons covered. A syndic master forwarding multiple downstream minions' returns to the upstream master opens several AsyncReqMessageClient instances from a single process at once -- they all shared the same identity, so ROUTER_HANDOVER=1 on the upstream master replaced each previous connection as the next syndic-relayed REQ arrived and silently dropped the in-flight reply. The reproducer is test_syndic_eauth.py::test_root_should_be_able_to_use_comprehensive _targeting (3006leak debian-11 integration zeromq 4); only the downstream minions hosted under the syndic stop returning. The stable identity scoping was always only useful for the salt CLI churn case where each invocation is its own short-lived process. Long-lived daemons -- minion, syndic, master -- all multiplex concurrent REQs over a single process and need libzmq's default per-connection random routing-ids to avoid HANDOVER drops. Test ``not self.opts.get("__role")`` so only CLIs get the bounded identity and every named daemon role falls through.

bdrx312 · 2026-06-24T22:07:06Z

+                if hasattr(self.ckminions.cache, "destroy"):
+                    self.ckminions.cache.destroy()
+                self.ckminions.cache = None
+            self.ckminions = None


Maybe use getattr instead of hasattr to make this slightly less verbose:

ckminions = getattr(self, "ckminions", None) cache = getattr(ckminions, "cache", None) if cache is not None: destroy = getattr(cache, "destroy", None) if destroy is not None: destroy() ckminions.cache = None self.ckminions = None

dwoz requested a review from a team as a code owner May 26, 2026 05:49

dwoz had a problem deploying to ci May 26, 2026 05:50 — with GitHub Actions Error

dwoz added the test:full Run the full test suite label May 26, 2026

dwoz temporarily deployed to ci May 26, 2026 05:50 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 26, 2026 06:04 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 26, 2026 06:22 — with GitHub Actions Inactive

Sxderp reviewed May 26, 2026

View reviewed changes

Comment thread salt/utils/timed_subprocess.py

dwoz force-pushed the 3006leak branch from 3518712 to a976f44 Compare May 26, 2026 22:18

dwoz temporarily deployed to ci May 26, 2026 22:18 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 26, 2026 22:30 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 26, 2026 22:51 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 4, 2026 22:21 — with GitHub Actions Inactive

dwoz mentioned this pull request Jun 6, 2026

[BUG] Bug Report: Minion Instability and Resource Exhaustion under High Load #68703

Open

Daniel A. Wozniak and others added 21 commits June 17, 2026 15:38

Fix resource leaks in master, minion, and API

33ad623

Resolve process, file descriptor, and memory leaks by adding explicit teardown and context managers for clients, periodic worker cache resets, and optimizing token listing with os.scandir.

Add nightly stress test workflow and monitoring tools

4691af9

Include a GitHub Actions workflow that performs aggressive stress testing on Master, Minion, and API components, with automated Prometheus-based regression analysis to detect resource leaks.

Fix lint errors in monitoring scripts

23bb5bc

Address pylint warnings in analyze_stats.py and fd_exporter.py related to unused imports, broad exceptions, and resource management.

Capture Prometheus metrics as build artifacts

4017532

Enable post-build introspection by snapshotting the Prometheus data directory and uploading it as a GitHub Action artifact.

Remove __del__ methods from leak fixes

0c3f53d

Explicit resource cleanup is preferred; __del__ methods introduced during memory and file handle leak fixes have been removed from: - MasterMinion - RunnerClient - SaltEvent - WheelClient

Fix file descriptor leak in ZeroMQSocketMonitor

6f1fe5b

Close the underlying PyZMQ monitor socket explicitly in `ZeroMQSocketMonitor.stop()` to prevent eventfd accumulation during repeated Minion connection failures. Fixes test_fd_leak.py

Fix missing explicit teardown for LazyLoaders and bound methods

90dc6a3

Merge remote-tracking branch 'origin/3006.x' into 3006leak

7c5b94b

# Conflicts: # salt/utils/event.py

twangboy approved these changes Jun 23, 2026

View reviewed changes

bdrx312 reviewed Jun 24, 2026

View reviewed changes

dwoz mentioned this pull request Jun 25, 2026

Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114) #69436

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix process and file descriptor leaks in Salt Master#69227

Fix process and file descriptor leaks in Salt Master#69227
dwoz merged 21 commits into
saltstack:3006.xfrom
dwoz:3006leak

dwoz commented May 26, 2026

Uh oh!

Uh oh!

dwoz commented Jun 8, 2026

Uh oh!

bdrx312 Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

dwoz commented May 26, 2026

Uh oh!

Uh oh!

dwoz commented Jun 8, 2026

Uh oh!

bdrx312 Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bdrx312 Jun 24, 2026 •

edited

Loading